In this section you will see the processes to fix the type of data, cleaning blank or null data, and the elimination of duplicates.
Fixing Data Type
We start by seeing how the ‘videogame’ data frame is composed with the Head function and with the Glimpse function it will tell us how many rows, columns, and the type of data within each column of the data frame video games. As a result, this tells us it has 12 columns and 16,622 rows. By data type, it appears numeric and mostly character.
When looking at the data type of each column in the data frame, several observations were found. First, I found that the “YEAR” column is categorized as a character when it should be an integer by its rounded number. Second, all the columns that have “SALES” in their name (NorthAmerica_Sales, Europe_Sales, RestofWorld_Sales, and Total_Sales) all their data are known as characters when it should be numeric by its decimal numbers. Finally, the columns “WON_AN_AWARD” and “GENRE” that it recognized its data as a character when it should be a factor because of their categories.
## # A tibble: 6 x 12
## Name Console Year Genre Publisher NorthAmerica_Sa… Europe_Sales Japan_Sales
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2002… PS2 year… Spor… Electron… 0.21 0.17 0.16
## 2 2002… XB year… Spor… Electron… 0.1400000000000… 0.04 0
## 3 2002… GC year… Spor… Electron… 0.04 0.01 0
## 4 2010… PS3 year… Spor… Electron… 0.3 0.64 0.07
## 5 2010… X360 year… Spor… Electron… 0.32 0.43 0.02
## 6 2010… PSP year… Spor… Electron… 0.09 0.24 0.03
## # … with 4 more variables: RestofWorld_Sales <chr>, Total_Sales <chr>,
## # Consumer_rating <dbl>, Won_an_award <chr>
## Rows: 16,622
## Columns: 12
## $ Name <chr> "2002 FIFA World Cup", "2002 FIFA World Cup", "200…
## $ Console <chr> "PS2", "XB", "GC", "PS3", "X360", "PSP", "Wii", "P…
## $ Year <chr> "year 2002", "year 2002", "year 2002", "year 2010"…
## $ Genre <chr> "Sports", "Sports", "Sports", "Sports", "Sports", …
## $ Publisher <chr> "Electronic Arts", "Electronic Arts", "Electronic …
## $ NorthAmerica_Sales <chr> "0.21", "0.14000000000000001", "0.04", "0.3", "0.3…
## $ Europe_Sales <chr> "0.17", "0.04", "0.01", "0.64", "0.43", "0.24", "0…
## $ Japan_Sales <dbl> 0.16, 0.00, 0.00, 0.07, 0.02, 0.03, 0.01, 0.01, 0.…
## $ RestofWorld_Sales <chr> "0.06", "0.01", "0", "0.22", "0.09", "0.12", "0.04…
## $ Total_Sales <chr> "0.59", "0.19", "0.05", "1.23", "0.86", "0.48", "0…
## $ Consumer_rating <dbl> 79, 40, 28, 82, 52, 74, 52, 61, 62, 75, 47, 72, 74…
## $ Won_an_award <chr> "did not win", "did not win", "did not win", "won"…
As we can see, by using the function Summary, the numeric data type will calculate the minimum, median, mean, maximum, and quartiles. On the other hand, for the character data type, it will only appear the lengths, class, and mode.
## Name Console Year Genre
## Length:16622 Length:16622 Length:16622 Length:16622
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Publisher NorthAmerica_Sales Europe_Sales Japan_Sales
## Length:16622 Length:16622 Length:16622 Min. :-123.00000
## Class :character Class :character Class :character 1st Qu.: 0.00000
## Mode :character Mode :character Mode :character Median : 0.00000
## Mean : 0.03484
## 3rd Qu.: 0.04000
## Max. : 10.22000
## RestofWorld_Sales Total_Sales Consumer_rating Won_an_award
## Length:16622 Length:16622 Min. : 0.00 Length:16622
## Class :character Class :character 1st Qu.: 17.00 Class :character
## Mode :character Mode :character Median : 36.00 Mode :character
## Mean : 38.65
## 3rd Qu.: 59.00
## Max. :100.00
Fixing the column “Year”: Why is Year not integer? We need to delete the “year” text in its data. By deleting this now will only appear the integer number of the year (Instead of “year 2002” it will only appear only “2002”). For this, we will be using the Mutate function to create a new column of the Year without “year” on it and the function Select to update the data frame with only the column that shows that Year is an integer.
videogame <- videogame %>% mutate(Year_trimmed = str_remove(Year, "year"),Years =as.integer(Year_trimmed)) %>%
select(Name, Console, Years, Genre, Publisher, NorthAmerica_Sales, Europe_Sales, Japan_Sales, RestofWorld_Sales, Total_Sales, Consumer_rating, Won_an_award)
Now it appears the minimum, median, mean, maximum, and quartiles for this new column
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1980 2003 2007 2006 2010 2020 271
Fixing the column “NorthAmerica_Sales, Europe_Sales, RestofWorld_Sales & Total_Sales”: In this case, we will do the same process to fix all the “Sales” columns in order to make them numeric instead of character. Here I use the As.numeric function to update the data frame by changing the old incorrect column to the new correct data type column.
videogame$NorthAmerica_Sales <- as.numeric(videogame$NorthAmerica_Sales)
videogame$Europe_Sales <- as.numeric(videogame$Europe_Sales)
videogame$RestofWorld_Sales <- as.numeric(videogame$RestofWorld_Sales)
videogame$Total_Sales <- as.numeric(videogame$Total_Sales)
Now we can observe that it appears the minimum, median, mean, maximum, and quartiles for all the columns that have “Sales” in their names.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.080 0.263 0.240 41.490 31
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0200 0.1547 0.1200 29.0200 638
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0100 0.0493 0.0400 10.5700 397
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.0700 0.1800 0.5504 0.4900 82.7400 381
Fixing the column “Genre”, “Won_an_award”, “Console” & “Publisher”: Here I use the As.factor function to update the data frame by changing the old incorrect column to the new correct data type column.
videogame$Won_an_award <- as.factor(videogame$Won_an_award)
videogame$Genre <- as.factor(videogame$Genre)
videogame$Console <- as.factor(videogame$Console)
videogame$Publisher <- as.factor(videogame$Publisher)
Now we see that this arrangement divided these columns into categories and counted how many are within each category.
summary(videogame$Won_an_award)
summary(videogame$Genre)
summary(videogame$Console)
summary(videogame$Publisher)
Let’s see a recap
Making a recap of what we have already seen and fixed here we use this new function (not seen in class) where it makes us a summary of all the data frame of videogames. This function is call Skim from the skimr library. This function tell us the number of columns, rows, the amount of data types we have, and a summary of each of the columns by the data type.
What this show?: 1. For character type columns, it shows us if there are missing values, min/max, unique data, and blank spaces. 2. For factor type columns, it shows us if there are missing values, unique data, and the amount of data that is per category. 3. For numeric columns, it shows us if there are missing values, complete_rate, the mean, standard deviation, p0, p25, p50, p75, p100, and a small histogram of the data.
# Need the skimr library for this function
skim(videogame)
Data summary
| Name |
videogame |
| Number of rows |
16622 |
| Number of columns |
12 |
| _______________________ |
|
| Column type frequency: |
|
| character |
1 |
| factor |
4 |
| numeric |
7 |
| ________________________ |
|
| Group variables |
None |
Variable type: character
Variable type: factor
| Console |
0 |
1 |
FALSE |
31 |
DS: 2165, PS2: 2164, PS3: 1329, Wii: 1328 |
| Genre |
0 |
1 |
FALSE |
12 |
Act: 3317, Spo: 2354, Mis: 1740, Rol: 1489 |
| Publisher |
0 |
1 |
FALSE |
582 |
Ele: 1355, Act: 976, Nam: 932, Ubi: 921 |
| Won_an_award |
0 |
1 |
FALSE |
2 |
did: 16477, won: 145 |
Variable type: numeric
| Years |
271 |
0.98 |
2006.41 |
5.83 |
1980.00 |
2003.00 |
2007.00 |
2010.00 |
2020.00 |
▁▁▃▇▂ |
| NorthAmerica_Sales |
31 |
1.00 |
0.26 |
0.82 |
0.00 |
0.00 |
0.08 |
0.24 |
41.49 |
▇▁▁▁▁ |
| Europe_Sales |
638 |
0.96 |
0.15 |
0.57 |
0.00 |
0.00 |
0.02 |
0.12 |
29.02 |
▇▁▁▁▁ |
| Japan_Sales |
0 |
1.00 |
0.03 |
1.92 |
-123.00 |
0.00 |
0.00 |
0.04 |
10.22 |
▁▁▁▁▇ |
| RestofWorld_Sales |
397 |
0.98 |
0.05 |
0.19 |
0.00 |
0.00 |
0.01 |
0.04 |
10.57 |
▇▁▁▁▁ |
| Total_Sales |
381 |
0.98 |
0.55 |
1.57 |
0.01 |
0.07 |
0.18 |
0.49 |
82.74 |
▇▁▁▁▁ |
| Consumer_rating |
0 |
1.00 |
38.65 |
25.61 |
0.00 |
17.00 |
36.00 |
59.00 |
100.00 |
▇▇▇▅▂ |
Cleaning the Data
Since we have the correct data types in each column, it is time to clean up the data frame. For this, we will be checking all duplicates, out of range values, and replace null values that may exist in the data.
Replacing NA for the Median: Now that we have the correct data type for the Years columns, it means that now it can calculate all dispersion measures and the amount of NA found in this column. As we can see it found 271 NA’s in these columns. To replace all the NA found here we are going to replace it with its median. As you can see in this first chunk these are all the calculation and count of NA in each column.This step is to check the before and after.
summary(videogame$Years)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1980 2003 2007 2006 2010 2020 271
Replacing NA for the median in the column -> Years
videogame$Years[which(is.na(videogame$Years))] = 2007
Now here we can see it doesn’t appear anymore NAs
summary(videogame$Years)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1980 2003 2007 2006 2010 2020
Replacing NA for the Mean:
summary(videogame$NorthAmerica_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.080 0.263 0.240 41.490 31
summary(videogame$Europe_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0200 0.1547 0.1200 29.0200 638
summary(videogame$RestofWorld_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0100 0.0493 0.0400 10.5700 397
summary(videogame$Total_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0100 0.0700 0.1800 0.5504 0.4900 82.7400 381
The process is the same for all columns. First we call the original data frame with the column to replace NA. Here we use the WHICH and IS.NA function to designate that where there is NA in the column X in the data frame Video Games should change it to the average. (The number is the average).
Replacing NA for the mean in the column -> NorthAmerica_r_Sale,Europe_r_Sales, RestWorld_r_Sales & video_games1$Total_r_Sales
videogame$NorthAmerica_Sales[which(is.na(videogame$NorthAmerica_Sales))] = 0.26
videogame$Europe_Sales[which(is.na(videogame$Europe_Sales))] = 0.15
videogame$RestofWorld_Sales[which(is.na(videogame$RestofWorld_Sales))] = 0.05
videogame$Total_Sales[which(is.na(videogame$Total_Sales))] = 0.55
Now here we can see it doesn’t appear anymore NAs
summary(videogame$NorthAmerica_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.080 0.263 0.240 41.490
summary(videogame$Europe_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0300 0.1545 0.1400 29.0200
summary(videogame$RestofWorld_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.01000 0.04927 0.04000 10.57000
summary(videogame$Total_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0700 0.1900 0.5504 0.5300 82.7400